Loading Project...

Back to Home

Towards Explainable Facial Emotion Recognition

Muhammad Usman Khan

Computer Vision Explainable AI Trustworthy AI Bias Auditing

Abstract

Most facial emotion recognition (FER) systems are evaluated solely on accuracy. This project argues that accuracy alone is insufficient for responsible deployment and introduces a three-pillar evaluation framework combining performance metrics, Grad-CAM explainability, and demographic bias auditing. The pipeline fine-tunes a ResNet-50 on the RAF-DB dataset (7 emotion classes, ~15K in-the-wild images), achieving 84.35% test accuracy. Grad-CAM heatmaps reveal that correctly classified images show activations on anatomically meaningful facial regions, while misclassified images expose failure modes including background distraction and inter-class confusion. A DeepFace-powered bias audit uncovers a 9.40% gender accuracy gap and 7.87% racial disparity, with intersectional analysis identifying specific subgroup–emotion combinations where the model underperforms most severely.

1. Introduction

Facial Expression Recognition (FER) is a core task in affective computing with applications spanning healthcare, human-computer interaction, education, and security. While deep learning models have driven significant accuracy improvements on FER benchmarks, deployment in high-stakes contexts demands more than predictive performance — it requires transparency in decision-making and fairness across demographic groups.

This study addresses these gaps through a unified pipeline that (1) fine-tunes a ResNet-50 using transfer learning for competitive accuracy, (2) applies Grad-CAM to generate visual explanations mapping model attention to facial Action Units (AUs), and (3) conducts a systematic demographic bias audit using DeepFace to infer perceived gender and race, enabling disaggregated performance analysis. The RAF-DB dataset serves as the experimental benchmark due to its in-the-wild image diversity and crowd-sourced emotion annotations.

2. About the Dataset

The dataset utilized for this research is RAF-DB (Real-world Affective Faces Database) — a widely recognized benchmark containing approximately 15,000 facial images collected from the internet with crowd-sourced emotion labels across 7 basic emotion categories.

Dataset Overview

Split Images
Training ~12,271
Testing 3,068
Classes 7 (Surprise, Fear, Disgust, Happiness, Sadness, Anger, Neutral)
Sample images from RAF-DB dataset showing diverse facial expressions across different subjects
Figure 1: Sample images from the RAF-DB dataset showing the 7 emotion categories

Class Distribution

The dataset exhibits significant class imbalance — Happiness dominates (~38% of training data) while Fear and Disgust are underrepresented (~2–5%). This imbalance directly impacts per-class performance and motivates the use of label smoothing during training.

Bar chart showing class distribution across 7 emotion categories in the RAF-DB training set
Figure 2: Class distribution in the RAF-DB training set

3. Research Methodology

A systematic pipeline was designed integrating transfer learning, explainability analysis, and fairness auditing to address the challenges of building trustworthy FER systems.

Pipeline Architecture

End-to-end pipeline diagram showing data flow from RAF-DB through ResNet-50, Grad-CAM, and DeepFace bias audit
Figure 3: End-to-end pipeline architecture

1. Data Preprocessing

All images were resized to 224×224 pixels to conform to ResNet-50's expected input dimensions. Training augmentations included random horizontal flip, slight rotation (±10°), colour jitter, and random erasing — standard techniques for regularizing on small-to-medium datasets. All images were normalized using ImageNet statistics (mean and standard deviation) to ensure compatibility with the pre-trained backbone.

2. Model Architecture

Given the limited size of RAF-DB (~12K training images), transfer learning was employed using a ResNet-50 pre-trained on ImageNet (V2 weights). Early layers (conv1 + layer1) were frozen to preserve low-level feature detectors, while layers 2–4 and the classification head were trainable.

A custom classification head was appended:

3. Training Strategy

Component Setting
Optimizer AdamW (lr=1e-4, weight_decay=1e-4)
Loss Cross-Entropy with label smoothing (0.1)
Scheduler ReduceLROnPlateau (patience=3, factor=0.5)
Early Stopping Patience = 5 epochs
Batch Size 64
Mixed Precision FP16 via torch.cuda.amp

4. Results

The fine-tuned ResNet-50 achieved competitive accuracy on the RAF-DB test set, demonstrating that transfer learning from ImageNet provides sufficient inductive bias for in-the-wild FER even under constrained compute budgets.

Training Curves

Training and validation accuracy and loss curves over 30 epochs
Figure 4: Training & validation accuracy/loss curves

Final Model Performance

Per-Class Classification Report

Emotion Precision Recall F1-Score Support
Surprise 0.8179 0.8602 0.8385 329
Fear 0.7241 0.5676 0.6364 74
Disgust 0.5714 0.5250 0.5472 160
Happiness 0.9545 0.9038 0.9285 1185
Sadness 0.8259 0.8138 0.8198 478
Anger 0.7622 0.7716 0.7669 162
Neutral 0.7816 0.8735 0.8250 680

Confusion Matrix

Confusion matrix showing predictions across 7 emotion classes
Figure 5: Confusion matrix — key confusions between Fear ↔ Surprise and Disgust → Anger/Sadness

5. Explainability with Grad-CAM

Grad-CAM (Selvaraju et al., 2017) computes the gradient of the predicted class score with respect to the feature maps of a target convolutional layer. These gradients are globally average-pooled to produce importance weights, which are used to generate a class-discriminative heatmap highlighting the input regions most influential for the prediction. We target layer4[-1] of ResNet-50 — the deepest convolutional block with the highest semantic abstraction.

Correctly Classified Examples

Correctly classified images showed Grad-CAM activations concentrated on anatomically meaningful facial regions, providing evidence that the model learns emotion-relevant features rather than spurious correlations.

Grad-CAM heatmaps overlaid on correctly classified facial images showing activation on relevant facial regions
Figure 6: Grad-CAM heatmaps for correctly classified images

Misclassified Examples

Misclassified images revealed failure modes including background distraction, occlusion sensitivity, and inter-class confusion — actionable insights for model improvement.

Grad-CAM heatmaps overlaid on misclassified facial images showing attention on non-facial or irrelevant regions
Figure 7: Grad-CAM heatmaps for misclassified images

Grad-CAM Observations

Emotion Grad-CAM Focus Facial Action Units
Happiness Mouth & cheeks AU6 (cheek raiser) + AU12 (lip corner puller)
Surprise Eyebrows & eyes AU1+2 (brow raise) + AU5 (upper lid raise)
Anger Brow & glabellar region AU4 (brow lowerer)
Fear Wide eye region AU1+2+4+5+20 (wide-eyed tension)
Disgust Nose & upper lip AU9 (nose wrinkler) + AU10 (upper lip raiser)
Sadness Lower face & eye corners Diffuse — subtler cues
Neutral Low activation No strong discriminative region

6. Demographic Bias Audit

Since RAF-DB does not provide explicit demographic labels, we employ the DeepFace library (Serengil & Ozpinar, 2021) to infer perceived gender and race on 1,000 test images. Model accuracy is then disaggregated across these proxy labels to surface performance disparities.

Accuracy by Gender

Bar chart comparing model accuracy between perceived Men and Women groups
Figure 8: Accuracy by perceived gender — 9.40% gap between Men (82.1%) and Women (91.5%)

Accuracy by Race

Bar chart comparing model accuracy across perceived racial groups
Figure 9: Accuracy by perceived race — 7.87% disparity between highest and lowest groups

Intersectional Analysis

The disaggregated analysis was extended to an intersectional level, examining accuracy across Gender × Emotion and Race × Emotion combinations. These heatmaps reveal that disparities are not uniform across emotions — certain subgroup–emotion combinations exhibit significantly lower accuracy, identifying priority targets for bias mitigation.

Heatmap showing accuracy for each Gender × Emotion combination
Figure 10: Intersectional accuracy — Gender × Emotion
Heatmap showing accuracy for each Race × Emotion combination
Figure 11: Intersectional accuracy — Race × Emotion

7. Conclusion

This study conducted an end-to-end pipeline for Explainable Facial Emotion Recognition using a ResNet-50 model fine-tuned on the RAF-DB dataset, combining quantitative evaluation, Grad-CAM-based explainability analysis, and proxy-label demographic bias auditing.

Key Contributions

  1. Model Performance: The fine-tuned ResNet-50 achieved competitive accuracy (84.35%) on the RAF-DB test set, demonstrating that transfer learning from ImageNet provides sufficient inductive bias for in-the-wild FER — even under constrained compute budgets.
  2. Explainability Insights: Correctly classified images showed Grad-CAM activations on anatomically meaningful regions (mouth for Happiness, brows for Anger), while misclassified images revealed failure modes including background distraction and occlusion sensitivity.
  3. Demographic Bias Findings: Disaggregated accuracy analysis revealed measurable performance disparities across perceived gender (9.40% gap) and racial groups (7.87% gap). Intersectional analysis identified specific subgroup–emotion combinations where the model underperforms most severely.

Implications for Trustworthy AI

These findings reinforce the argument that accuracy alone is an insufficient metric for evaluating FER systems. Responsible deployment requires transparency (Grad-CAM or alternative XAI methods for every high-stakes prediction), fairness auditing (disaggregated evaluation across protected attributes), dataset diversification, and bias mitigation strategies.

Limitations & Future Work

Limitation Potential Improvement
Demographic labels are inferred (not self-reported) Use datasets with verified demographic metadata
Single architecture (ResNet-50) Compare with ViT, EfficientNet, DAN
Moderate overfitting (98.5% train vs 84.4% val) Stronger regularization, larger dataset
Disgust/Fear classes underperform Class-balanced sampling, focal loss
Static explainability (Grad-CAM only) Add Attention Rollout, LIME, counterfactual explanations
Single dataset (RAF-DB) Cross-dataset validation on AffectNet, FER2013, ExpW

8. References